WiSe23/24
Humboldt-Universität zu Berlin
2023-10-10
library(broman)
# function to format p-values
format_pval <- function(pval){
dplyr::case_when(
pval < .001 ~ "< .001",
pval < .01 ~ "< .01",
pval < .05 ~ "< .05",
TRUE ~ broman::myround(pval, 3)
)
}
# round to nearest non-zero decimal
my_round = function(x, n=2) {
max(abs(round(x, n)), abs(signif(x, 1))) * sign(x)
}Today we will learn…
lm() functionMake sure you always start with a clean R Environment (Session > Restart R). This means you should have no objects stored in your Environment, and no packages loaded. To ensure this, you can go to the Session tab (up where you’ll find File, Help, etc.), and select Restart R. You can also use the keyboard shortcut Cmd/Ctrl+Shift+0 (that’s a zero, not an ‘oh’).
In addition, I often prefer to run options(scipen=999) in order to supress scientific notation, which writes very large or very small numbers in an unintuitive way. For example, 0.000005 is written 5e-06 in scientific notation.
We’ll also need to load in our required packages. Hopefully you’ve already install the required packages (if not, go to ?@sec-software).
\[ RT \sim frequency \]
read_csv() function from readrclean_names() function from the janitor package tidies up variable names (e.g., no spaces, all lower case).head()head() from base R to see the first 6 rows of our data# A tibble: 6 × 3
word freq rt
<chr> <dbl> <dbl>
1 thing 55522 622.
2 life 40629 520.
3 door 14895 507.
4 angel 3992 637.
5 beer 3850 587.
6 disgrace 409 705
word, freq, and rtsummary()summary() provides summaries of each variable in a dataframe word freq rt
Length:12 Min. : 4.0 Min. :507.4
Class :character 1st Qu.: 57.5 1st Qu.:605.2
Mode :character Median : 325.0 Median :670.8
Mean : 9990.2 Mean :679.9
3rd Qu.: 6717.8 3rd Qu.:771.2
Max. :55522.0 Max. :877.5
freq and rt look like?lm()lm() function fits simple linear models\[ lm(outcome \sim 1 + predictor,\;data\;=\;df\_name) \]
1 the intercept is still included in the formula1 with 0df_freq data:name <- value)Object naming
df in df_freq stand for ‘data frame’
fit_rt_1, using ‘fit’ to signal that this object is a model fit. You could also save it as mod_freq_1, lm_freq_1, or whatever you see fit (there are no rules)fig_freq or plot_freq
Call:
lm(formula = rt ~ 1, data = df_freq)
Coefficients:
(Intercept)
679.9
intercept and slope are called coefficients
Intercept?summary() function to print full model outputs.
Call:
lm(formula = rt ~ 1, data = df_freq)
Residuals:
Min 1Q Median 3Q Max
-172.537 -74.677 -9.137 91.296 197.613
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 679.92 34.02 19.99 0.000000000538 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 117.8 on 11 degrees of freedom
broom package
The broom package has some useful functions for printing model outputs
tidy() produces a tibble (type of dataframe) of the coefficientsglance() produces goodness of fit measures (which we won’t discuss)The outputs from tidy() and glance() can be fed into kable and/or kable_styling() to create formatted tables
# A tibble: 1 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 680. 34.0 20.0 5.38e-10
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 0 118. NA NA NA -73.7 151. 152.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
augment() adds model values as columns to your dataframe (e.g., useful for plotting observed vs. fitted values).
Call:
1lm(formula = rt ~ 1, data = df_freq)
Residuals:
Min 1Q Median 3Q Max
2-172.537 -74.677 -9.137 91.296 197.613
Coefficients:
3 Estimate Std. Error t value Pr(>|t|)
4(Intercept) 679.92 34.02 19.99 0.000000000538 ***
---
5Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
6Residual standard error: 117.8 on 11 degrees of freedomEstimates, standard error, t-value, p-value (Pr(>|t|))
rt ~ 1Figure 1: Visualisation of ‘rt ~ 1’: observed values (black) and mean (intercept; red). Residuals would be the distance from each black dot to the y-value of the read dot
rt) when we move 1-unit along \(y\) (IV: freq)
Call:
lm(formula = rt ~ freq, data = df_freq)
Residuals:
Min 1Q Median 3Q Max
-155.947 -73.141 2.117 85.050 163.837
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 713.706298 34.639105 20.60 0.0000000016 ***
freq -0.003382 0.001699 -1.99 0.0746 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 104.6 on 10 degrees of freedom
Multiple R-squared: 0.2838, Adjusted R-squared: 0.2121
F-statistic: 3.962 on 1 and 10 DF, p-value: 0.07457
freq
-0.003382289
rt) for a 1-unit change in \(x\) (our IV: freq)
1 2 3 4 5 6
525.9148 576.2873 663.3271 700.2042 700.6845 712.3229
[1] 621.77 519.56 507.38 636.56 587.18 705.00
1 2 3 4 5 6
95.855154 -56.727276 -155.947103 -63.644200 -113.504485 -7.322942
1 2 3 4 5 6
95.855154 -56.727276 -155.947103 -63.644200 -113.504485 -7.322942
(Intercept) freq
713.706297951 -0.003382289
(Intercept)
713.7063
ignore the (Intercept) label here, R just takes the first label when performing an operation on 2 vectors
what is the predicted reaction time for a word with frequency score of 5000?
(Intercept)
696.7949
Figure 2: Image source: Winter (2019) (all rights reserved)
augment() function from broom to append model values to our original data frame, and then feed this into ggplot() from ggplot2 (or even feed it into hist()).Higher-frequency words had longer reaction times, but this effect was not significant (\(\hat{\beta}\) = -0.003; t = -1.99).
# issue with mathmode in quarto; use unicode and fmt_markdown (https://kazuyanagimoto.com/blog/2022/12/27/quarto_tips/)
tidy(fit_rt_freq) |>
mutate(p.value = format_pval(p.value),
estimate = round(estimate,3),
std.error = round(std.error,3),
statistic = round(statistic,2)) |>
# rename(`ˆβ` = estimate) |>
gt() |>
cols_label(
term = "Coefficient",
estimate = "ˆβ",
std.error = "SE",
statistic = md("*t*"),
p.value = md("*p*")
) |>
fmt_markdown(columns = everything()) |>
fmt_number(decimals = 3,
drop_trailing_zeros = T) |>
tab_options(table.font.size = 36)| Coefficient | ˆβ | SE | t | p |
|---|---|---|---|---|
(Intercept) |
713.706 | 34.639 | 20.6 | < .001 |
freq |
−0.003 | 0.002 | −1.99 | 0.075 |
| R^2 | sigma | df | AIC | BIC |
|---|---|---|---|---|
| 0.284 | 104.595 | 1 | 149.469 | 150.924 |
# plot model predictions
fig_fit <-
sjPlot::plot_model(fit_rt_freq, type = "emm", terms = "freq") +
scale_y_continuous(limits = c(300,900))
fig_raw <-
df_freq |>
ggplot() +
aes(x = freq, y = rt) +
labs(title = "Observed values") +
geom_point() +
geom_smooth(method = "lm", colour = "red", fill = "pink") +
scale_y_continuous(limits = c(300,900))
library(patchwork)
fig_fit + fig_raw + plot_annotation(tag_levels = "A")Today we learned…
Now it’s your turn. Try to run the following lm() models: